Flash memory cache including for use with persistent key-value store

ABSTRACT

Described is using flash memory, RAM-based data structures and mechanisms to provide a flash store for caching data items (e.g., key-value pairs) in flash pages. A RAM-based index maps data items to flash pages, and a RAM-based write buffer maintains data items to be written to the flash store, e.g., when a full page can be written. A recycle mechanism makes used pages in the flash store available by destaging a data item to a hard disk or reinserting it into the write buffer, based on its access pattern. The flash store may be used in a data deduplication system, in which the data items comprise chunk-identifier, metadata pairs, in which each chunk-identifier corresponds to a hash of a chunk of data that indicates. The RAM and flash are accessed with the chunk-identifier (e.g., as a key) to determine whether a chunk is a new chunk or a duplicate.

BACKGROUND

Flash media has advantages over RAM and hard disk storage, namely thatunlike RAM, flash media is persistent, and unlike hard disk, flash mediaprovides much faster data access times, e.g., on the order of hundredsor thousands of times faster than hard disk access. Many applicationsthus may benefit from the use of flash media.

However, flash media is expensive, at present costing ten to twentytimes more per gigabyte than hard disk storage. Further, flash devicesare subject to reduced lifetimes due to page wearing, whereby smallrandom writes (that also have relatively high latency) are notdesirable. What is needed is a technology for using flash media thatprovides high performance, while factoring in cost considerations,efficiency and flash media lifetimes.

SUMMARY

This Summary is provided to introduce a selection of representativeconcepts in a simplified form that are further described below in theDetailed Description. This Summary is not intended to identify keyfeatures or essential features of the claimed subject matter, nor is itintended to be used in any way that would limit the scope of the claimedsubject matter.

Briefly, various aspects of the subject matter described herein aredirected towards a technology by which a flash memory is configured intoa secondary storage device (e.g., a flash store and/or a flash store anda disk-based device) via RAM-based data structures and mechanisms so asto maintain a cache of data items (e.g., key value pairs) in flashpages. A RAM-based index maps each data item in the flash store to thepage in which that data item is maintained, and a RAM-based write buffermaintains data items to be written to the flash store. A mechanism(e.g., one or more threads) uses the RAM-based index to locate dataitems in the flash store, and to write data items from the RAM-basedwrite buffer to the flash store. The write may occur when the data itemsfill a page, or when a coalesce time is reached.

In one aspect, the flash store serves as a cache between RAM and a harddisk store. The mechanism looks for a data item in a RAM-basedread/write cache (e.g., comprising a RAM-based read cache and theRAM-based write buffer) before using the RAM-based index to locate dataitems in the flash store. A recycle mechanism makes a page in the flashstore available by processing valid data items on the page, includingdestaging a data item from the page in the flash store to the hard diskstore or reinserting the data item into the write buffer, based onwhether the information indicates that the data item has been recentlyaccessed. A data structure (e.g., a bloom filter pair is used to track(to a high probability) whether a data item has been recently accessed.Another data structure (a bloom filter) indicates to a high probabilitywhether a data item has been destaged to the hard disk store.

In one aspect, the flash store is used in conjunction with RAM in a datadeduplication system. The data items comprise chunk-identifier, metadatapairs, in which each chunk-identifier is representative of a hash of achunk of data, which is used to determine whether that chunk is aduplicate of another chunk of data. The chunks are maintained incontainers. If the chunk-identifier is in the flash store, chunks of acontainer corresponding to that chunk identifier are prefetched into theRAM cache. If the chunk identifier is not in the RAM cache, theRAM-based write-buffer, or the flash store, the chunk identifier isdeemed to represent a new chunk, and the data of that chunk added to acontainer, with a chunk identifier, metadata pair for that chunk to theRAM-based write-buffer.

Other advantages may become apparent from the following detaileddescription when taken in conjunction with the drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention is illustrated by way of example and not limitedin the accompanying figures in which like reference numerals indicatesimilar elements and in which:

FIG. 1 is a block diagram representing an example architecture and datastructures for using flash media as a cache between RAM and hard drivestorage.

FIGS. 2 and 3 comprise a flow diagram representing example steps forlooking up a key of a key-value pair in RAM, flash memory or a harddrive as needed.

FIG. 4 is a flow diagram representing example steps for handlinginsertion of a key into a flash-based architecture.

FIG. 5 is a flow diagram representing example steps for recycling pagesof flash when pages are needed for storage.

FIG. 6 is a block diagram representing an example architecture and datastructures for a deduplication system that uses flash media as a cachebetween RAM and hard drive storage.

FIG. 7 is a flow diagram representing example steps taken by aflash-based deduplication system to handle chunks of incoming data.

FIG. 8 shows an illustrative example of a computing environment intowhich various aspects of the present invention may be incorporated.

DETAILED DESCRIPTION

Various aspects of the technology described herein are generallydirected towards using flash media as a cache between RAM and hard diskstorage. In general, various data structures and mechanisms (e.g.,algorithms) suitable for a given application allow data items such askey-value pairs to be efficiently looked up and/or inserted while storedon RAM or flash memory, in a manner that substantially reduces or avoidsunnecessary hard disk access. One example implementation describedherein maintains key-value pairs and provides efficient key lookup andinsert operations, including based upon predetermined tradeoffs betweenperformance and cost. Another example implementation provides anefficient and cost effective system for facilitating data deduplicationoperations.

It should be understood that any of the examples herein arenon-limiting. Indeed, the technology described herein applies to anytype of non-volatile storage that is faster than disk access, not onlythe flash media described herein. Moreover, the data structuresdescribed herein are only examples of ways to use a cache according tothe technology described herein. As such, the present invention is notlimited to any particular embodiments, aspects, concepts, structures,functionalities or examples described herein. Rather, any of theembodiments, aspects, concepts, structures, functionalities or examplesdescribed herein are non-limiting, and the present invention may be usedin various ways that provide benefits and advantages in computing anddata retrieval in general.

FIG. 1 shows example architectural components of one implementation of akey-value store maintained among relatively very fast RAM 102,relatively fast non-volatile storage (“flash store” 104) and arelatively slow hard disk data store 106. The hard disk data store 106is in general significantly slower with respect to data access than theflash store 104, and may be maintained on any suitable hard disk device,whether local or remote, and regardless of how many hard disks and/orother mechanisms make up the hard disk device.

A RAM write buffer 108 comprising a data structure (e.g., of fixed-size)maintained in the RAM 102 buffers data item writes such that a write ismade to the flash store 104 only in a controlled manner, e.g., whenthere is enough data to fill a flash page (which is typically 2 KB or 4KB in size, and is known in advance). As used in the example of FIG. 1and for purposes of the example description herein, the data itemscomprise key-value pairs, however any suitable data item may be usedwith the technology described herein.

The flash store 104 provides persistent storage for the key-value pairsand may be organized as a recycled append log, in which the pages onflash are maintained implicitly as a circular linked list. Because theflash translation layer (FTL) translates logical page numbers tophysical ones, it is straightforward to implement the circular linkedlist as a contiguous block of logical page addresses with wraparound.This may be realized by two page number variables, one for the firstvalid page (oldest written) and the other for the last valid page (mostrecently written). Note that FIG. 1 represents valid pages (containingmaintained data) as non-shaded, and invalid pages (available for use) asshaded. In one implementation, each flash page begins with a headerportion that contains metadata information including the time when thepage was written, the number of key-value pairs in the page, and thebeginning offset for each.

In one implementation, the key-value pairs are written to flash in unitsof a page size containing a set of pairs. Each key-value entry in theflash store 104 also has an associated write operation timestamp. Toachieve desired persistency considerations, writes to flash 104 also maybe made when a pre-specified coalesce time interval is reached, asdescribed below. In general, the RAM write buffer is sized two to threetimes the flash page size so that key-value writes can still occur whenanother part of the RAM write buffer 108 is being written to flash 106.

In one implementation, a RAM hash table index 110 provides an indexstructure for locating the key-value pairs stored on the flash store104. The hash table index 110 is maintained in RAM and is organized as ahash table having pointers to the full key-value pairs stored on theflash store 104, with a general goal of one flash read per lookup. Asdescribed below, there is provided a mechanism for resolving collisions,which in one implementation is based upon a variant of cuckoo hashing.Also described is storing compact key signatures in memory, which allowsbalancing between RAM usage versus false flash reads.

Another aspect is directed towards destaging recently unused key-valuepairs from the flash store 104 to the hard disk store 106, such as whenRAM or flash bottlenecks are reached, to eliminate the need forrehashing. To this end, a RAM read cache 112 (e.g., of fixed-size)provides a read cache of recently read items that is maintained in RAM.A least recently used policy (or other suitable mechanism) evictskey-value pairs when inserting items into a full cache.

Also shown in FIG. 1 is a pair of destaging bloom filters 114 (or othersuitable data structure), which is used by a flash recycling thread asdescribed below to determine to a high probability whether a validkey-value pair on flash has been recently accessed. As is known, a bloomfilter is a probabilistic data structure in which false positives arepossible, which are acceptable in this usage scenario. If determined tobe recently accessed, a key-value pair is reinserted into the RAM writebuffer 108, (where it will be written back to the flash store 104),otherwise the pair is destaged to the hard disk store 106. Adisk-presence bloom filter 116 (or other suitable data structure) isused to record the keys that are destaged to the hard disk store 106, asalso described below. This (to a high probability) avoids looking upnon-existent keys, and thereby avoids hard disk access latencies.

Various real-world applications may use this flash-based technology asan underlying persistent key-value store. For example, onlinemulti-player gaming technology allows people from geographically diverseregions to participate in the same game. The number of concurrentplayers in such a game may range from tens to hundreds of thousands, andthe number of concurrent game instances offered by a single onlineservice may range from tens to hundreds. Key-value pairs are thus usedin such an online multi-player gaming application, with high throughputand low latency being desirable for the get-set key operations. At thesame time, persistency is desirable for purposes of resuming a game froman interrupted state if and when crashes occur, for offline analysis ofgame popularity, progression, and dynamics with the objective ofimproving the game, and/or verification of player actions for fairnesswhen outcomes are associated with monetary rewards. The flash-basedtechnology described herein meets these needs.

FIGS. 2-4 are example block/flow diagrams explaining the sequence ofaccesses in key lookup and insert operations, e.g., via client-calledAPIs 120 (FIG. 1), given the hierarchical relationship of the differentstorage areas. As represented in FIG. 2, a key lookup operation (get)first looks for a key in the RAM read cache 112 (step 202). Step 204evaluates the cache hit or miss; if there is a cache hit (step 204), theprocess branches ahead to step 224 to return the associated value. Ifthere is a miss, the process continues to step 206.

Step 206 looks for the key in the RAM write buffer 108. Upon a miss(step 208), the process searches the RAM hash table index 110 at step210 in an attempt to locate the key on the flash store 104. Upon a miss(step 212), step 214 looks up the key in the disk-presence bloom filter116. If the key is not present, step 216 branches to step 222 to returnnull. Otherwise, step 218 searches the hard disk store 106 for the key,where it is ordinarily present as indicated by the disk-presence bloomfilter 116. However, if the key is not found, (e.g., the bloom filterreturned a false positive), step 220 branches to step 222 to returnnull.

As represented by step 224 and FIG. 3, if the key is found at any placeother than the RAM read cache, the key-value pair is inserted into theRAM read cache (step 306). Note that via steps 302 and 304, if the readcache is full a suitable key (e.g., the least recently used) is evicted.At step 308, data representing the key is also inserted into thedestaging bloom filter 114 to indicate that it has been recentlyaccessed, as described below. Step 226 of FIG. 2 returns the value.

Turning to a key insert (update/set) operation as represented in FIG. 4,step 402 writes the key-value pair (together with its timestamp) intothe RAM write buffer 108. If an earlier value of the key exists in theRAM read cache 112, as evaluated by step 404, it will be invalidated atstep 406.

As represented by step 408, when there are enough key-value pairs in RAMwrite buffer to fill a flash page, a page of these entries is written toflash and inserted to the RAM hash table index at step 412. Also shownin FIG. 4 is (optional step 410) is writing the write buffer to flashwhen a coalesce time interval threshold is met, that is, when less thana page exists. Note that such timed writing is likely event driven andperformed by a separate process (or thread), but is shown in FIG. 4 forcompleteness. Such a timed writing to the flash store 104 provides forpersistency by ensuring that any key written to RAM is persisted withinthe coalesce time, to handle situations in which few keys are beingwritten and thus the page does not fill rapidly enough. The coalescetime may be configurable.

In a typical usage scenario, eventually the pages in flash store 104will begin to fill up. When this occurs, e.g., when flash usage exceedsa certain threshold, (e.g., eighty percent) any previously used, validkeys are maintained as described below and the page evicted/recycled foruse. Recycling may also be based on the RAM hash table index usage; forexample, when the hash table index 110 exceeds a target maximum loadfactor (e.g., ninety percent), recycling may be performed to bring theusage below this threshold. In such a scenario, the flash store 104serves as a cache for the much larger hard disk store 106. Differentrecycling operations may be applied to determine which keys and valuesare stored in the flash store 104, and which keys and values aredestaged to the hard disk store 106.

One recycling operation considers currently used flash pages in oldestfirst order. On each page, the key-value pairs are scanned to determinewhether they are valid or not. A key-value pair on a flash page isinvalid (or, orphaned) if the record in the hash table index 110 forthat key does not point to this entry on this flash page, which happenswhen a later write to the key happened.

Another recycling policy is the least recent used (LRU) policy. In sucha case, each key-value pair has a flag, which is updated every time thekey-value pair has been accessed. When the flash store 104 or the RAMindex reaches a desired occupancy level, least recent used (LRU)key-value pairs are destaged to the hard disk store 106.

yet another recycling policy is the first in first out (FIFO) policy. Inthis case, the first key-value pair that was written to the flash store104 is evicted when the flash store 104 or the RAM index reaches adesired occupancy level. FIFO is simpler to implement compared with LRU,but is less accurate in retaining the working set in flash store 104.

As described above and in general, the pages on the flash store 104 areused in a circular linked list order, and the oldest pages areevicted/recycled after determining how to handle the valid key-valuepairs of each such page. To this end, a recycle/eviction mechanism 122(algorithm), generally represented in the flow diagram of FIG. 5,processes valid keys either by reinserting them into the flash store 104(by reinserting the key-value pair into the RAM write buffer 108 wherethey will be later paged to the flash store 104) or by destaging them tothe hard disk store 106.

In the example of FIG. 5, when the flash store 104 and/or hash tableindex 110 reaches a threshold usage level as determined via step 502,step 504 finds a page (e.g., the oldest) for recycling/eviction. Steps506 and 518 selects the keys for processing, generally by discarding anyinvalid key (steps 508 and 510), or otherwise destaging each key (step514) to the hard disk store 106 or reinserting each key into the writebuffer (step 516), depending on the key's access pattern, as maintainedin the destaging bloom filter pair 114 as evaluated by step 512. Again,note that a false positive is acceptable, because no data is lost, andat worst a key-value pair that was not recently accessed is handled asif it was recently accessed. Note that for keys destaged to the harddisk store 106, a small number of bits per entry may be stored andmaintained in the hash table index 110 as described below.

In one implementation, the access pattern is maintained in a rotatingpair of destaging bloom filters 114 (FIG. 1) in the RAM 102 thatinterchange between themselves as the currently used one. Each bloomfilter is dimensioned to record the last b recently accessed uniquekeys, where b is a parameter that determines the amount of accesshistory that is maintained (chosen to be larger than the cardinality ofthe current working set of key-value pairs; known methods for estimatingworking set size may be used). The current bloom filter and a counterare each initialized to zero. When a key is accessed, the key isinserted into the current bloom filter and the counter incremented ifthe key was not already in the bloom filter. Upon hitting the value of bunique accesses associated with the current bloom filter, the counter isreset to zero and usage switches to the other bloom filter (afterreinitializing it). During the flash recycling operation, the flashrecycling thread checks both bloom filters in RAM to determine theaccess pattern. The false positive property of a bloom filter makes theeviction policy more conservative, that is, if the presence of a key inthe bloom filter is a false positive event, then that key is retained inthe flash store 104 (by way of the write buffer 108) even though it wasnot actually accessed recently, although it may be destaged insubsequent flash recycling iterations.

Once the keys of a page have been processed in this way, step 520 evictsthe page from the flash store 104, whereby it is again available(recycled) for use in the circular list, that is, the first page numbervariable is incremented. As represented by step 522, thiseviction/recycling operation may be done until the threshold is met, ormay be done for multiple pages to drop some percentage (e.g., tenpercent) below the threshold, e.g., when the flash store threshold usagelevel is reached, process N pages so that the threshold is not met everytime a single page is written.

The hard disk store 106 thus serves to store the key-value pairs thathave been evicted from the flash store 104 because of page recycling.Because key lookups can miss in RAM and flash, the hard disk store 106may be indexed to provide efficient access to the keys stored therein.In one implementation, a known embedded key-value database is used forindexing.

In addition to the insert and lookup operations, the write-time orderedlog based storage organization in the flash store supports queries forretrieving the keys that have been modified since a given time t. Toprocess such a query, the system locates the earliest flash page writtenat a time equal to or later than t, and scans the keys in the pagesstarting from that up to the last valid page in logical page numberorder. Keys having a write timestamp less than t are discarded from theresults; note that they may appear in these pages because of beingreinserted as a result of page recycling.

Turning to additional details of the hash table index 110, the hashtable index 110 is structured as an array of slots. In oneimplementation, hash collisions, in which multiple keys map to the samehash table index slot, are resolved using a variant of cuckoo hashing.To this end, cuckoo hashing provides flexibility for each key to be inone of n≧2 positions; this keeps the linear probing chain sequence upperbounded at n. Note that cuckoo hashing increases hash table load factorswhile keeping lookup time bounded to a constant.

In the variant of cuckoo hashing used in the example implementation, nrandom hash functions h₁, h₂; . . . h_(n) are used to obtain n candidatepositions for a given key x. These candidate position indices for key xare obtained from the lower-order bit values of h₁(x), h₂(x); . . .h_(n)(x) corresponding to a modulo operation.

During insertion, the key is inserted in the first available candidateslot. When all slots for a given key x are occupied during insertion,(e.g., by keys y₁, y₂; . . . y_(n)), room can be made for key x byrelocating keys y_(i) in these occupied slots, because each key y_(i)may be placed in a choice of (n−1) other locations. Note that in theoriginal cuckoo hashing scheme, a recursive strategy is used to relocateone of the keys y_(i); however in a worst case, this strategy may takemany key relocations or get into an infinite loop, the probability forwhich can be shown to be very small and decreasing exponentially in n.In the variant described herein, the process attempts a small number ofkey relocations, after which if unsuccessful, the process makes room bypicking a key to destage to the hard disk store 106. In practice, bydimensioning the hash table index 110 for a certain load factor and bychoosing a suitable value of n, such events can be made extremely rare.

In an optimization, the amount of RAM usage per slot may be reduced bystoring compact key signatures. Note that conventional hash tabledesigns store the respective key in each entry of the hash table index.Depending on the application, the key size may range from few tens ofbytes (e.g., 20-byte SHA-1 hash) to hundreds of bytes or more. Giventhat RAM size is limited (on the order of gigabytes), if the full key isstored in each entry of the RAM hash table index, this may cause abottleneck with respect to the maximum number of entries in the hashtable index before the flash storage capacity bounds are reached.Conversely, if a key is not stored in the in the hash table index 110,the search operation on the hash table index 110 needs to follow hashtable index pointers to the flash store 104 to determine whether the keystored in that slot matches the search key. This may lead to relativelymany false flash reads, which are expensive, as flash access speeds aretwo to three orders of magnitude slower than that of RAM.

To approach maximizing hash table index capacity (the number of entries)while minimizing false flash reads, one implementation stores a compactkey signature (on the order of a few bytes, e.g., two bytes) in eachentry of the hash table index 110. This signature is derived from boththe key and the candidate position number at which the key is stored.When a key x is stored in its candidate position number i, the signaturein the respective hash table index slot is derived from the higher orderbits of the hash value h_(i)(x). During a search operation, when a key yis looked up in its candidate slot number j, the respective signature iscomputed from h_(j)(y)) and compared with the signature stored in thatslot. Only if a match happens is the pointer to the flash store followedto check if the full key matches. The percentage of false reads isrelatively low.

Key-value pairs may be organized in the flash store 104 in alog-structure in the order of the respective write operations cominginto the system. As described above, the hash table index 110 containspointers to the key-value pairs stored in the flash store 104. Oneimplementation uses a four-byte pointer, which is a combination of apage pointer and a page offset. By way of example, consider a 160 GBflash store with 4 KB pages, which is representative of contemporarydevices. In this example, a page number may be specified with log₂(160GB/4 KB)=26 bits. The remaining six bits can be used for the in-pageoffset, which point to 128 B boundaries in a 4 KB page; the storedkey-value pairs are thus aligned at 128 B boundaries. Note that apointer having a value of all ones (binary) is used to indicate an emptyhash table index slot.

The flash store 104 may designed to maximize the RAM hash table indexcapacity because this determines the number of key-value pairs stored inthe flash store 104 that can be accessed with one flash read. The RAMsize for the hash table index 110 may be determined based upon theapplication requirements. For example, with a two-byte compact keysignature and four-byte flash pointer per entry, a typical RAM usage of4 GB for the hash table index 110 index accommodates a maximum of about715 million entries. Whether RAM or flash capacity becomes thebottleneck for storing the working set of keys on flash depends on thekey-value pair size. With 64-byte key-value pairs, 715 million entriesin the hash table index occupy 42 GB on flash, which is easilyaccommodated in contemporary flash devices. With multiple flash devices,additional RAM may be provided to fully utilize them. Conversely, with1024-byte key-value pairs, the 715 million entries in the hash tableindex 110 need 672 GB of flash, whereby multiple flash devices givencontemporary flash device sizes.

Note that the functionalities of key lookup/insert operations, writingkey-value pairs to the flash store 104 and updating the RAM hash tableindex, and/or recycling of flash pages (including reinserting/destagingkey-value pairs) may be handled by separate threads in a multi-threadedarchitecture, as described below. Concurrency issues with shared datastructures may arise in a multi-threaded design, and may be handled asalso described below.

More particularly, to attempt to maximize throughput of key lookup andinsert operations, the flash store mechanisms may be multi-threaded,with logical partitioning of system functionality across differentthreads. For example, one or more client serving threads may perform thekey lookup/insert operations received from the client. For a writeoperation, the client serving thread is responsible for adding thekey-value pair to the RAM write buffer; if the key already exists in theRAM read cache, it invalidates that entry. A flash writing thread writesthe key-value pairs to the flash store, and removes these entries fromthe RAM write buffer. A flash recycling thread performs the recyclingand destaging/reinsertion operations. One or more hard disk storemanagement thread(s) may be used, e.g., the known “Berkeley DB” for diskstore management has a multi-threaded architecture that provides anembedded key-value database that may be used to store and index thedestaged key-value pairs.

Concurrency issues with shared data structures arise in themulti-threaded design, which are handled through thread synchronizationusing locks. So that a thread does not block unless it needs to, locksmay be employed at suitable levels of granularity, that is, for correctconcurrent execution and to avoid busy waiting. The following tablesummarizes the type of access (read or write) that different threadsneed on each shared data structure and the type of lock with which it isprotected:

Data Structure Accessing Threads Access Type Lock Type RAM write ClientServing Threads Read/Write Producer- buffer Flash Writing ThreadRead/Write Consumer- Write Reader RAM hash Client Serving Threads ReadReader-Writer table index Read/Write Read/Write RAM read Flash WritingThread Read/Write Reader-Writer cache RAM Bloom Client Serving ThreadsRead/Write Reader-Writer filters Flash Recycling Thread Read/Write

The RAM read cache 112 is accessed by the client serving threads. Asdescribed above, a thread executing a key lookup operation reads thecache and upon a miss, inserts the current key-value pair (read fromelsewhere) after evicting another key-value pair (if the read cache wasfull). The RAM write buffer 108 has key-value pairs added to it byclient serving threads and flash recycling thread; any such thread needsto block if the buffer is full. Also, the flash writing thread needs toblock until the key-value pairs in the buffer are confirmed written to aflash page. Thus, the client serving/flash recycling threads and theflash writing thread have a producer-consumer relationship on the RAMwrite buffer. Moreover, the client serving threads also need to read thebuffer upon a miss in the RAM read cache during a read key operation.Thus, the RAM write buffer needs to be protected by a combination ofproducer-consumer and reader-writer locks. Known synchronizationtechniques used separately for each of them are adapted to obtain acombined lock of the desired nature, referred to herein as aproducer-consumer-reader lock.

The RAM hash table index 110 is read by the client serving threadsduring a read key operation and read/written by the flash writing andflash recycling threads, and thus is served by a reader-writer lock.However, to maximize the number of concurrent operations on the hashtable index, it may be necessary to lock the hash table index 110 at thelevel of each entry, which if performed creates significant overheadassociated with maintenance of so many locks. Conversely, using only onelock for the entire hash table index 110 minimizes the number ofconcurrent operations allowed leading to unnecessary blocking ofthreads. In one implementation, a balance is provided by letting thehash table index have N slots, and partitioning the hash table indexinto m segments, with each segment having N/m contiguous slots; segmentlevel locks are then used. When a thread needs to access slot i of thehash table index, the thread obtains the appropriate type of lock (reador write) on segment number [i/m]. Under this design, two threads thatneed to respectively read and write different slots in the same segmentneed to compete for the same segment lock; thus the design compromiseson maximum allowable concurrency to reduce the overhead from the numberof locks.

Another aspect is that the persistency guarantee enables the flash-basedsystem to recover from system crashes, e.g., due to power failure orother reasons. Because the system logs the key-value write operations toflash, it is straightforward to rebuild the hash table index in RAM byscanning the valid flash pages on flash. Recovery using this method cantake some time, however, depending on the total size of valid flashpages that need to be scanned and the read throughput of the flashmemory. If crash recovery needs to be executed faster so as to support“near” real-time recovery, then the RAM hash table index may beoccasionally/periodically checkpointed into flash (in a separate areafrom the key-value pair logs). For example, the recycling process treatsthe content stored in the secondary storage device (flash store) as astream, and for each key-value pair in the flash store, checks if it ispointed by a pointer in the RAM-index. If pointed to, the key-value pairis copied into a new stream, and garbage collection is performed on atleast a portion of a previous stream. The RAM index is periodicallycheckedpointed into a storage device in association with a current endposition of the key-value store stream for use in crash recovery.

Recovery then involves reading the last written hash table indexcheckpoint from flash and scanning key-value pair logged flash pageswith timestamps after the checkpoint to and inserting them into therestored hash table index. During the operation of checkpointing thehash table index, the insert operations need to be suspended (althoughread operations by other threads may continue). The flash writing threadcan continue with flash writing operations during this time but cannotinsert items into the hash table index. A temporary, small in-RAM hashtable may be used to provide an index for the interim items. After thecheckpointing operation completes, any key-value pairs from the flashpages written in the interim are inserted into the hash table index. Keylookup operations, upon missing in the hash table index, check in theseflash pages (via the small additional hash table) until the latterinsertions into the hash table index are complete. The flash recyclingthread is suspended during the hash table index checkpointing operation,since the recycling thread cannot set hash table index entries to null.

Note that by using known concepts, the flash store may be extended tomultiple nodes. For example, one approach may use a one-hop distributedhash table (DHT) based on consistent hashing to map the key space acrossmultiple nodes. An alternative approach is to use hash function-basedpartitioning of keys across nodes, with each node protected by buddypair machines; note however that new nodes cannot be added easily,because a hash function does not have the locality preservingredistribution properties of consistent hashing.

Turning to another aspect, storage deduplication refers to identifyingduplicate data using disk-based indexes on chunk hashes, which has anumber of benefits in computing including using inline deduplication toprovide high backup throughput. However, storage deduplication cancreate throughput bottlenecks due to the disk I/Os involved in indexlookups. While known RAM prefetching and bloom filter based techniqueshelp avoid disk I/Os on a high percentage (e.g., close to ninety-ninepercent) of the index lookups, even at this reduced rate the indexlookups that do go to disk cause potential problems.

The technology described herein is able to reduce the penalty of indexlookup misses in RAM typically by orders of magnitude, namely by servingsuch lookups from a flash memory-based index and thereby increasinginline deduplication throughput. The use of flash memory as describedherein is able to reduce the significant gap between RAM and hard diskin terms of both cost and access times.

To this end, as generally represented in FIG. 6, a flash-based inlinededuplication system using a chunk metadata store on a flash store 604is provided. In one implementation, the system uses one flash read perchunk lookup and works with RAM prefetching strategies.

In general, and similar to the above-described flash based key-valuesystem, the deduplication system organizes chunk metadata in alog-structure on the flash store 604 to exploit fast sequential writes,while using an in-memory hash table index 610 to index them, with hashcollisions resolved by the above-described variant of cuckoo hashing.Also similar to as described above, the in-memory hash table index 610may store compact key signatures instead of full chunk hashes so as tobalance tradeoffs between RAM usage and false flash reads. Further, byindexing a small fraction of chunks per container, the system can reduceRAM usage significantly with negligible loss in deduplication quality.One implementation of the system can index 6 TB of unique (deduplicated)data using 45 GB of flash.

In one implementation, data chunks coming into the system are identifiedby their SHA-1 hash, and, via a deduplication chunk handling mechanism622 (described below with reference to FIG. 7), are looked up in anindex of currently existing chunks in the system (for that storagelocation or stream). If a match is found, the metadata for the file (orobject) containing that chunk is updated to point to the location of theexisting chunk. If there is no match, the new chunk is stored in thesystem and the metadata for the associated file is updated to point toit. One implementation allocates 44 bytes for the metadata portion, withthe 20-byte chunk hash comprising the key and the 44-byte metadata beingthe value, for a total key-value pair size of 64 bytes.

Rabin fingerprinting-based sliding window hash may be used on the datastream to identify chunk boundaries in a content-dependent manner. Achunk boundary is declared when the lower order bits of the Rabinfingerprint match a certain pattern. The length of the pattern can beadjusted to vary the average chunk size. The average chunk size in onesystem is 8 KB; Ziv-Lempel compression on individual chunks can achievean average compression ratio of two to one, so that the size of thestored chunks on hard disk averages around 4 KB. The SHA-1 hash of achunk serves as its chunk-id in the system described herein.

The system may target complete deduplication and ensure that noduplicate chunks exist in the system after deduplication. However, atechnique for RAM usage reduction that comes at the expense of marginalloss in deduplication quality may be provided.

A container store 606 on a hard disk manages the storage of chunks. Inone implementation, each container stores at most 1024 chunks andaverages in size around 4 MB. As new (non-duplicate) chunks come intothe system, they are appended to a current container 640 buffered in RAM602. When the current container 640 reaches a target size of 1024chunks, it is sealed and written to hard disk and a new (empty)container is opened for future use.

A RAM chunk metadata write buffer 608 (e.g., of fixed size) buffers thechunk metadata information for the currently open container 640. Thebuffer is written to flash when the current container is sealed, e.g.,the buffer accumulates 1024 chunk entries and reaches a size of 64 KB.The RAM write buffer 612 is sized to two-to-three times the flash pagesize so that chunk metadata writes can still go through when part of thebuffer is being written to flash.

To eliminate hard disk accesses for chunk-id lookup, the flash store 604maintains metadata for chunks maintained in the system, indexed with theRAM hash table index 610. A cache 612 for chunk metadata is alsomaintained in the RAM 602. The fetch (prefetch) and eviction policiesmay be executed at the container level (i.e., metadata for all chunks ina container).

To implement such a container level prefetch and eviction policy, a RAMcontainer metadata cache 642 (e.g., fixed-size) for the chunk metadatamay be maintained for the containers whose chunk metadata is currentlyheld in RAM; this cache 642 maps a container-id to the chunk-ids itcontains. In one implementation, the size of this container cache 642determines the size of the chunk metadata cache, as a container has 1024chunks. For a RAM chunk metadata cache eviction strategy, the containermetadata cache 642 in RAM may follow a least recently used (LRU)replacement policy. When a container is evicted from this cache, itscontaining chunk-ids are removed from the chunk metadata cache 612. Notethat the deduplication system does not need to use bloom filters toavoid hard disk lookups for non-existent chunks.

With respect to a prefetching strategy, the predictability of sequentialchunk-id lookups during second and subsequent full backups may be usedin a known manner. Because datasets do not change much across twobackups, duplicate chunks in a current full backup are very likely toappear in the same order as they did in the previous backup. As aresult, when the metadata for a chunk is fetched from flash (upon a missin the chunk metadata cache 612 in RAM 602), the system prefetches themetadata for the chunks in that container into the chunk metadata cache612 in RAM and adds the associated container's entry to the RAMcontainer metadata cache 642. Because of this prefetching strategy, itis generally likely that the next several hundreds or thousands of chunklookups will hit in the RAM chunk metadata cache 612.

In one implementation, the chunk metadata storage is organized on flashinto logical page units of 64 KB, which corresponds to the metadata forthe chunks in a single container (1024 chunks at 64 bytes per chunk-idand metadata). The RAM hash table index is generally similar to thatdescribed above, as the index maintains pointers to the pairs (ofchunk-id, metadata) stored on the flash store 604. As described above,collisions may be resolved using a variant of cuckoo hashing, whilecompact key signatures may be maintained in memory to tradeoff betweenRAM usage and false flash reads.

FIG. 7 summarizes the hierarchical relationship of the different storageareas in the deduplication system, via a flow diagram showing a sequenceof accesses during inline deduplication. When a new chunk comes into thesystem, its SHA-1 hash is first looked up to determine if the chunk is aduplicate one. If not, the new chunk-id is inserted into the system.

In the flash-based deduplication system, a chunk-id lookup operationlooks up the RAM chunk metadata cache 612 as represented by step 702. Iffound, it is a duplicate chunk (step 716), and otherwise handledaccordingly, e.g., the file/object pointer is updated to point to theexisting chunk.

Upon a miss in the RAM chunk metadata cache 612, at step 704 themechanism 622 looks for the key in the RAM chunk metadata write buffer608, and if found, branches to step 716. If missed, at step 706 themechanism 622 searches the RAM hash table index 610 to attempt to inorder to locate the chunk-id in the flash store 604. If the chunk-id ispresent in the flash store 604, at step 708 its metadata, together withthe metadata of the chunks in the respective container, is prefetchedinto the RAM chunk metadata cache, and the chunk handled as a duplicateat step 716.

A chunk-id insert operation happens when the chunk coming into thesystem has not been seen earlier, as represented by step 710. Step 710represents a number operations, including writing the chunk metadatainto the RAM chunk metadata write buffer; the chunk itself is appendedto the currently open container buffered in RAM.

As evaluated by step 712, when the number of chunk entries in the RAMchunk metadata write buffer reaches the target (e.g., of 1024) for thecurrent container, at step 714 the container is sealed and written tothe container store on hard disk, and its associated chunk metadataentries are written to the flash store 604 and inserted to the RAM hashtable index 610.

With respect to RAM and flash capacity considerations, the deduplicationsystem is designed to use a small number of bytes in RAM per entry so asto maximize the RAM hash table index capacity for a given RAM usagesize. The RAM hash table index capacity determines the number ofchunk-ids stored on flash whose metadata can be accessed with one flashread. The RAM size for the hash table index 610 can be determined withapplication requirements in mind. With a two-byte compact key signatureand four-byte flash pointer per entry, which is a total of six bytes perentry; a typical RAM usage of 4 GB per machine for the hash table indexaccommodates a maximum of about 715 million chunk-id entries. At anaverage of 8 KB size per data chunk, this accommodates about 6 TB ofdeduplicated data. With 64 bytes allocated for a chunk-id and itsmetadata, this corresponds to about 45 GB of chunk metadata.

For efficient inline deduplication, the entire chunk metadata for the(current) backup dataset is fit into the flash store 604. Otherwise,when space on flash runs out, the append log needs to be recycled andwritten from the beginning. When a page on the flash log is rewritten,the earlier one needs to be evicted and the metadata contained thereinwritten out to a hard disk-based index; then, during the chunk-id lookupprocess, if the chunk is not found in flash, it will need to be lookedup in the index on hard disk. Thus, unless fit into the flash store 604,both the chunk-id insert and lookup pathways potentially suffer from thesame bottlenecks of disk index based systems

As described herein, the system uses flash memory to store chunkmetadata and index it from RAM, while providing flexibility for flashmemory to serve, or not serve, as a permanent location for chunkmetadata for a given storage location. This decision can be driven bycost considerations, for example, because of the difference in costbetween flash memory and hard disk. The chunk metadata log on flash canbe written to hard disk in one large sequential write (single disk I/O)to hard disk at the end of the backup process. At the beginning of thenext full backup for this storage location, the chunk metadata log canbe loaded back into flash from hard disk in one large sequential read(single disk I/O) and the containing chunks can be indexed in the RAMhash table index. This mode of operation amortizes the storage cost ofmetadata on flash across many backup datasets.

With respect to reducing the system RAM usage, the largest portion ofRAM usage in the system comes from the hash table index 610. This usagecan be reduced by indexing in RAM only a small fraction of the chunks atthe beginning of each container (instead of the whole container). Notethat the flash memory continues to hold metadata for all chunks in allcontainers, not just the ones indexed in RAM. Further, note thatindexing chunks at the beginning of a container (versus uniformly atrandom over the container, for example) has benefits, including thatbecause of sequential predictability of chunk-id lookups during secondand subsequent full backups, the first few chunks in a container areeffective predictors that the next several hundreds or thousands ofchunks in the incoming stream will come from this container. As aresult, the benefit of prefetching container metadata is the highestwhen one of its first few chunks is accessed. However, when only asubset of chunks stored in the system are indexed in the RAM hash tableindex, detection of duplicate chunks is not completely accurate, i.e.,some incoming chunks that are not found in the RAM hash table index mayhave appeared earlier and are already stored in the system. This willlead to some loss in deduplication quality in that some amount ofduplicate data chunks will be stored in the system. However, the qualityreduction tends to be marginal with respect to the reduction in RAMusage, and thus this tradeoff is useful in many situations.

Exemplary Operating Environment

FIG. 8 illustrates an example of a suitable computing and networkingenvironment 800 on which the examples of FIGS. 1-7 may be implemented.The computing system environment 800 is only one example of a suitablecomputing environment and is not intended to suggest any limitation asto the scope of use or functionality of the invention. Neither shouldthe computing environment 800 be interpreted as having any dependency orrequirement relating to any one or combination of components illustratedin the exemplary operating environment 800.

The invention is operational with numerous other general purpose orspecial purpose computing system environments or configurations.Examples of well-known computing systems, environments, and/orconfigurations that may be suitable for use with the invention include,but are not limited to: personal computers, server computers, hand-heldor laptop devices, tablet devices, multiprocessor systems,microprocessor-based systems, set top boxes, programmable consumerelectronics, network PCs, minicomputers, mainframe computers,distributed computing environments that include any of the above systemsor devices, and the like.

The invention may be described in the general context ofcomputer-executable instructions, such as program modules, beingexecuted by a computer. Generally, program modules include routines,programs, objects, components, data structures, and so forth, whichperform particular tasks or implement particular abstract data types.The invention may also be practiced in distributed computingenvironments where tasks are performed by remote processing devices thatare linked through a communications network. In a distributed computingenvironment, program modules may be located in local and/or remotecomputer storage media including memory storage devices.

With reference to FIG. 8, an exemplary system for implementing variousaspects of the invention may include a general purpose computing devicein the form of a computer 810. Components of the computer 810 mayinclude, but are not limited to, a processing unit 820, a system memory830, and a system bus 821 that couples various system componentsincluding the system memory to the processing unit 820. The system bus821 may be any of several types of bus structures including a memory busor memory controller, a peripheral bus, and a local bus using any of avariety of bus architectures. By way of example, and not limitation,such architectures include Industry Standard Architecture (ISA) bus,Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, VideoElectronics Standards Association (VESA) local bus, and PeripheralComponent Interconnect (PCI) bus also known as Mezzanine bus.

The computer 810 typically includes a variety of computer-readablemedia. Computer-readable media can be any available media that can beaccessed by the computer 810 and includes both volatile and nonvolatilemedia, and removable and non-removable media. By way of example, and notlimitation, computer-readable media may comprise computer storage mediaand communication media. Computer storage media includes volatile andnonvolatile, removable and non-removable media implemented in any methodor technology for storage of information such as computer-readableinstructions, data structures, program modules or other data. Computerstorage media includes, but is not limited to, RAM, ROM, EEPROM, flashmemory or other memory technology, CD-ROM, digital versatile disks (DVD)or other optical disk storage, magnetic cassettes, magnetic tape,magnetic disk storage or other magnetic storage devices, or any othermedium which can be used to store the desired information and which canaccessed by the computer 810. Communication media typically embodiescomputer-readable instructions, data structures, program modules orother data in a modulated data signal such as a carrier wave or othertransport mechanism and includes any information delivery media. Theterm “modulated data signal” means a signal that has one or more of itscharacteristics set or changed in such a manner as to encode informationin the signal. By way of example, and not limitation, communicationmedia includes wired media such as a wired network or direct-wiredconnection, and wireless media such as acoustic, RF, infrared and otherwireless media. Combinations of the any of the above may also beincluded within the scope of computer-readable media.

The system memory 830 includes computer storage media in the form ofvolatile and/or nonvolatile memory such as read only memory (ROM) 831and random access memory (RAM) 832. A basic input/output system 833(BIOS), containing the basic routines that help to transfer informationbetween elements within computer 810, such as during start-up, istypically stored in ROM 831. RAM 832 typically contains data and/orprogram modules that are immediately accessible to and/or presentlybeing operated on by processing unit 820. By way of example, and notlimitation, FIG. 8 illustrates operating system 834, applicationprograms 835, other program modules 836 and program data 837.

The computer 810 may also include other removable/non-removable,volatile/nonvolatile computer storage media. By way of example only,FIG. 8 illustrates a hard disk drive 841 that reads from or writes tonon-removable, nonvolatile magnetic media, a magnetic disk drive 851that reads from or writes to a removable, nonvolatile magnetic disk 852,and an optical disk drive 855 that reads from or writes to a removable,nonvolatile optical disk 856 such as a CD ROM or other optical media.Other removable/non-removable, volatile/nonvolatile computer storagemedia that can be used in the exemplary operating environment include,but are not limited to, magnetic tape cassettes, flash memory cards,digital versatile disks, digital video tape, solid state RAM, solidstate ROM, and the like. The hard disk drive 841 is typically connectedto the system bus 821 through a non-removable memory interface such asinterface 840, and magnetic disk drive 851 and optical disk drive 855are typically connected to the system bus 821 by a removable memoryinterface, such as interface 850.

The drives and their associated computer storage media, described aboveand illustrated in FIG. 8, provide storage of computer-readableinstructions, data structures, program modules and other data for thecomputer 810. In FIG. 8, for example, hard disk drive 841 is illustratedas storing operating system 844, application programs 845, other programmodules 846 and program data 847. Note that these components can eitherbe the same as or different from operating system 834, applicationprograms 835, other program modules 836, and program data 837. Operatingsystem 844, application programs 845, other program modules 846, andprogram data 847 are given different numbers herein to illustrate that,at a minimum, they are different copies. A user may enter commands andinformation into the computer 810 through input devices such as atablet, or electronic digitizer, 864, a microphone 863, a keyboard 862and pointing device 861, commonly referred to as mouse, trackball ortouch pad. Other input devices not shown in FIG. 8 may include ajoystick, game pad, satellite dish, scanner, or the like. These andother input devices are often connected to the processing unit 820through a user input interface 860 that is coupled to the system bus,but may be connected by other interface and bus structures, such as aparallel port, game port or a universal serial bus (USB). A monitor 891or other type of display device is also connected to the system bus 821via an interface, such as a video interface 890. The monitor 891 mayalso be integrated with a touch-screen panel or the like. Note that themonitor and/or touch screen panel can be physically coupled to a housingin which the computing device 810 is incorporated, such as in atablet-type personal computer. In addition, computers such as thecomputing device 810 may also include other peripheral output devicessuch as speakers 895 and printer 896, which may be connected through anoutput peripheral interface 894 or the like.

The computer 810 may operate in a networked environment using logicalconnections to one or more remote computers, such as a remote computer880. The remote computer 880 may be a personal computer, a server, arouter, a network PC, a peer device or other common network node, andtypically includes many or all of the elements described above relativeto the computer 810, although only a memory storage device 881 has beenillustrated in FIG. 8. The logical connections depicted in FIG. 8include one or more local area networks (LAN) 871 and one or more widearea networks (WAN) 873, but may also include other networks. Suchnetworking environments are commonplace in offices, enterprise-widecomputer networks, intranets and the Internet.

When used in a LAN networking environment, the computer 810 is connectedto the LAN 871 through a network interface or adapter 870. When used ina WAN networking environment, the computer 810 typically includes amodem 872 or other means for establishing communications over the WAN873, such as the Internet. The modem 872, which may be internal orexternal, may be connected to the system bus 821 via the user inputinterface 860 or other appropriate mechanism. A wireless networkingcomponent such as comprising an interface and antenna may be coupledthrough a suitable device such as an access point or peer computer to aWAN or LAN. In a networked environment, program modules depictedrelative to the computer 810, or portions thereof, may be stored in theremote memory storage device. By way of example, and not limitation,FIG. 8 illustrates remote application programs 885 as residing on memorydevice 881. It may be appreciated that the network connections shown areexemplary and other means of establishing a communications link betweenthe computers may be used.

An auxiliary subsystem 899 (e.g., for auxiliary display of content) maybe connected via the user interface 860 to allow data such as programcontent, system status and event notifications to be provided to theuser, even if the main portions of the computer system are in a lowpower state. The auxiliary subsystem 899 may be connected to the modem872 and/or network interface 870 to allow communication between thesesystems while the main processing unit 820 is in a low power state.

CONCLUSION

While the invention is susceptible to various modifications andalternative constructions, certain illustrated embodiments thereof areshown in the drawings and have been described above in detail. It shouldbe understood, however, that there is no intention to limit theinvention to the specific forms disclosed, but on the contrary, theintention is to cover all modifications, alternative constructions, andequivalents falling within the spirit and scope of the invention.

1. In a computing environment, a system comprising, a storage mechanismconfigured to maintain data items in pages, with at least some pages ina secondary storage device; a RAM-based index; and the storage mechanismaccessing the RAM-based index to determine whether a data item isretrievable, the index returning information corresponding to one ormore pages in which the data item is maintained, or returninginformation indicating that the data item cannot be found.
 2. The systemof claim 1 wherein the RAM-based index has a compact footprint andincludes a truncated cuckoo hash table, in which each entry of the indexcomprises a compact footprint checksum and a pointer to a page, theindex of each data item configured for storage in one of a plurality oflocations in the table, and wherein the checksum validates whether thedata item is stored in the page.
 3. The system of claim 1 wherein thedata items comprise key-value pairs, with the key and associated valuecomprising arbitrary byte arrays.
 4. The system of claim 1 wherein thesecondary storage device comprises a non-volatile memory devices or ahard drive device, or both a non-volatile memory device and a hard drivedevice.
 5. The system of claim 1 wherein the storage mechanism writesdata items to the storage device, including writing data items to a pageand inserting a compact index of the data items into the RAM-basedindex.
 6. The system of claim 1, wherein the storage mechanism furthercomprises a RAM-based write buffer that maintains data items to bewritten to the secondary storage, and wherein the storage mechanismwrites a page of data items from the RAM-based write buffer to thesecondary storage when the data items fill a page, or writes less than apage of data items from the RAM-based write buffer to the secondarystore device when a coalesce time is reached.
 7. The system of claim 1wherein the secondary storage device comprises a flash store, andfurther comprising disk-based storage and a recycle mechanism that makesa page in the flash store again available for use by destaging at leastsome of the data items from the flash store to the disk-based storage.8. The system of claim 7, wherein the recycle mechanism includes anoldest first policy, least recently used policy, or first-in, first outpolicy.
 9. The system of claim 1 wherein the secondary storage devicecomprises a flash store, and further comprising a data structure thatincludes information that indicates to a high probability whether a dataitem has been recently accessed, and a recycle mechanism that makes apage in the flash store available by processing valid data items on thepage, including destaging a data item from the page in the flash storeto a disk-based storage or reinserting the data item into a RAM-basedwrite buffer to be written back to the flash store, based on whether theinformation in the data structure indicates that the data item has beenrecently accessed.
 10. The system of claim 1 further comprising aRAM-based read/write cache above the secondary storage, and wherein thestorage mechanism looks up a data item in the RAM-based read/write cachebefore accessing the RAM-based index to locate data items in thesecondary storage.
 11. In a computing environment, a method performed onat least one processor, comprising: maintaining key-value pairs in asecondary storage device; maintaining a RAM-based index with compactfootprint that contains information for locating the key-value pairsmaintained in the secondary storage device; and looking for a key byaccessing the RAM-based index to look for one or more locations of thekey-value pair in the secondary storage device.
 12. The method of claim11, wherein writing the key-value pairs comprises writing a key-valuepair to the secondary storage device, and adding an entry into theRAM-based index for that key.
 13. The method of claim 12 wherein lookingfor the key of the key-value pair is performed by a first thread,wherein writing the key-value pair is performed by a second thread, andwherein a recycling process is performed by a third thread, in which atleast one of the threads uses a locking mechanism.
 14. The method ofclaim 11, wherein the RAM-based index comprises a truncated cuckoo hashtable, wherein each entry of the index comprises a compact checksum andan additional pointer, wherein each key may be stored in one of multiplelocations in the index with locations determined by multiple hashfunctions, wherein a compact checksum is calculated for each locationand key, wherein looking up the key comprises checking in multiplelocations in the index whether the stored checksum matches the checksumof the key, and wherein the pointers of any locations with a checksummatch are each returned as a pointer to a location in which thekey-value pairs can be stored.
 15. The method of claim 14, wherein whenthe key does not exist in the storage device no pointers are retrieved,or if one or more pointers are retrieved, the method further compriseschecking content pointed to by the pointers to validate if the key isstored in the location, and if so, returning the value of the key-valuepair.
 16. The method of claim 11, wherein some of the key-value pairsare stored in RAM, and wherein the pointer is divided into a firstsubspace and a second subspace, in which the first subspace points to alocation in RAM, and the second subspace points to a location in thesecondary storage.
 17. The method of claim 11, wherein writing thekey-value pairs into the secondary storage device includes appending thekey-value pair to a logical end of the secondary storage device,retrieving one or more pointers of the existing key in the RAM-index,checking existing pointers to determine if a previous version of thekey-value pair is stored in the secondary storage device and RAM-basedindex, and if a previous version exists, replacing the pointer to theprevious version of the key-value pair with the pointer to a new versionof the key-value pair and rendering the previous version of thekey-value pair as not pointed to by any pointers in the RAM-index tothereby be processed by a recycling process, and if no previous versionexists, storing the pointer of a new version of the key-value pair in anunoccupied locations if one exists, and if and if no unoccupied locationis found, relocating a pointer stored in a location to an alternativelocation or destaging the pointer and the associated key-value pair toanother storage device.
 18. The method of claim 17, wherein relocatingthe pointer includes: (a) retrieving keys associated with the pointer aspotential relocation candidates, (b) finding an alternative location foreach of the relocation candidates, (c) if an unoccupied alternativelocation is found, relocating the relocation candidate, (d) and if allalternative locations of all relocation candidates are occupied, addingthe keys at all the alternative locations as relocation candidates andrepeating from step (a).
 19. The method of claim 17 wherein therecycling process treats the content stored in the secondary storagedevice as a stream, and for each key-value pair in the secondary storagedevice, checks if it is pointed by a pointer in the RAM-index, and ifpointed to, performs: copying the key-value pair into a new stream;garbage collecting at least a portion of a previous stream, andperiodically checkpointing the RAM-based index into a storage device inassociation with a current end position of the key-value store streamfor use in crash recovery.
 20. In a computing environment, a systemcomprising, a secondary storage device, a compact RAM-based indexcorresponding to data items in the secondary storage device, and amechanism that resolves RAM-based index collisions comprising more thanone data item having a common storage location with a common checksum inthe RAM-based index, the mechanism resolving the collision by moving atleast one index entry to another location that does not correspond to acollision, or if no other location is found after one or more attempts,by destaging a data item from the secondary storage device to a thirdstorage device and removing a corresponding index entry for that dataitem from the hash table index.